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ABSTRACT 

Minwise hashing is a standard technique in the context of search for 
approximating set similarities. The recent work B271 demonstrated 
a potential use of 6-bit minwise hashing 1 26 1 for batch learning on 
large data. However, several critical issues must be tackled before 
one can apply b-bit minwise hashing to the volumes of data often 
used industrial applications, especially in the context of search. 

(b-bit) Minwise hashing requires an expensive preprocessing step 
that computes k (e.g., 500) minimal values after applying the corre- 
sponding permutations for each data vector. Note that the required 
k is often substantially larger for classification tasks than for dupli- 
cate detections (which mainly concern highly similar pairs) . We 
developed a parallelization scheme using GPUs and observed that 
the preprocessing time can be reduced by a factor of 20 ~ 80 and 
becomes substantially smaller than the data loading time. Reduc- 
ing the preprocessing time is highly beneficial in practice, e.g., for 
duplicate Web page detection (where minwise hashing is a major 
step in the crawling pipeline) or for increasing the testing speed of 
online classifiers. 

One major advantage of 6-bit minwise hashing is that it can sub- 
stantially reduce the amount of memory required for batch learn- 
ing. However, as online algorithms become increasingly popular 
for large-scale learning in the context of search, it is not clear if 6- 
bit minwise yields significant improvements for them. This paper 
demonstrates that 6-bit minwise hashing provides an effective data 
size/dimension reduction scheme and hence it can dramatically re- 
duce the data loading time for each epoch of the online training 
process. This is significant because online learning often requires 
many (e.g., 10 to 100) epochs to reach a sufficient accuracy. 

Another critical issue is that for very large data sets it becomes im- 
possible to store a (fully) random permutation matrix, due to its 
space requirements. Our paper is the first study to demonstrate that 
6-bit minwise hashing implemented using simple hash functions, 
e.g., the 2-universal (2U) and 4-universal (4U) hash families, can 
produce very similar learning results as using fully random permu- 
tations. Experiments on datasets of up to 200GB are presented. 

1. INTRODUCTION 

Minwise hashing (6}{8) ls a standard technique for efficiently 
computing set similarities in the context of search, with further 
applications in the context of content matching for online adver- 
tising [29 1, detection of redundancy in enterprise file systems 1171 , 
syntactic similarity algorithms for enterprise information manage- 
ment [11|, Web spam [36], etc. The recent development of 6- 



bit minwise hashing |26| provided a substantial improvement in 
the estimation accuracy and speed by proposing a new estimator 
that stores only the lowest 6 bits of each hashed value. More re- 
cently, 1 27 1 proposed the use of 6-bit minwise hashing in the con- 
text of learning algorithms such as SVM or logistic regression on 
large binary data (which is typical in Web classification tasks). 6- 
bit minwise hashing can enable scalable learning where otherwise 
massive (and expensive) parallel architectures would have been re- 
quired, at negligible reduction in learning quality. In [27 1, exper- 
iments showed this for the webspam dataset which has 16 million 
features with a total disk size of 24GB in standard LibS VM format. 

However, several crucial issues must be tackled before one can 
apply 6-bit minwise hashing to industrial applications. To under- 
stand these issues, we begin with a review of the method. 

1.1 A Review of 6-Bit Minwise Hashing 

Minwise hashing mainly focuses on binary (0/1) data, which can 
be viewed as sets. Consider sets Si, S2 C f2 = {0, 1,2, D— 1}, 
minwise hashing applies a random permutation n : Q — > on Si 
and S2 and uses the following collision probability 



Pr (minWSi)) = min(^(S 2 ))) = {§^j|rj = R 



(1) 



to estimate R, which is the resemblance between Si and S2 . With 
k permutations: 7Ti, 71^, one can estimate R without bias: 



R* 



k 



(2) 



3=1 



z\ — min(7Tj(Si)), Z2 — min(7Tj (S2)). 



A common practice is to store each hashed value, e.g., min(7r(Si)), 
using 64 bits 1161 . The storage (and computational) cost is pro- 
hibitive in industrial applications 1 28 ] . The recent work of 6-bit 
minwise hashing |26| provides a simple solution by storing only 
the lowest 6 bits of each hashed value. For convenience, we define 



the lowest 6 bits of Z\ , 



the lowest 6-bits of 22 • 



Theorem 1. $2$ Assume D is large. 

P b = Pr [zf ] = = Ci, b + (1 - Ca.0 R, (3) 

where and C^.b are functions of(D, |Si|, IS2I, |Si D S2I). □ 

Based on Theorem [T] we can estimate Pj, (and R) from k inde- 
pendent permutations 7Ti, 7T2, 



Rb = 



Pb - Ci, 
1 - C 2 j 



(4) 



3=1 



The estimator Pf, is an inner product between two vectors in 2 b x k 
dimensions with exactly k l's, because 

i{ z [ b) = 4 b) } = 2 j:n4 b) = t}xi{4 b) =t} ( 5 ) 

i=0 

This provides a practical strategy for using 6-bit minwise hashing 
for large-scale learning. That is, each original data vector is trans- 
formed into a new data point consisting of k 6-bit integers, which 
is expanded into a 2 b x fc-length binary vector at the run-time. 

These days, many machine learning applications, especially in 
the context of search, are faced with large and inherently high- 
dimensional datasets. For example, [35 1 discusses training datasets 
with (on average) n = 10 11 items and D = 10 9 distinct features. 
1371 experimented with a dataset of potentially D = 16 trillion 
(1.6 x 10 13 ) unique features. Effective algorithms for data/feature 
reductions will be highly beneficial for these industry applications. 

1.2 Linear Learning Algorithms 

Clearly, 6-bit minwise hashing can approximate both linear and 
nonlinear kernels (if they are functions of the inner products). We 
focus on linear learning because many high-dimensional datasets 
used in the context of search are naturally suitable for linear algo- 
rithms. Realistically, for industrial applications, "almost all the big 
impact algorithms operate in pseudo-linear or better time" (Tj. 

Linear algorithms such as linear SVM and logistic regression 
have become very powerful and extremely popular. Representa- 
tive software packages include SVM perf 1211 . Pegasos 1321 , Bot- 
toms SGD SVM (41, and LIB LINEAR fB). 

Given a dataset {(xi,2/i)}" =1 ,Xj 6 R D , y t £ {-1, 1}, the L 2 - 
regularized linear SVM solves the following optimization problem: 

1 n 
min -w T w + C*y^ max |l — yiW T Xj, oj , (6) 

i = l 

and the L2 -regularized logistic regression solves a similar problem: 

1 n 
min iw T w + C^]log fl-he"^ X 'J . (7) 

i=i 

Here C > is an important penalty parameter. 

Next, we elaborate on 3 major issues one must tackle in order to 
apply 6-bit minwise hashing to large-scale industrial applications. 

1.3 Issue 1: Expensive Preprocessing 

(6-bit) Minwise hashing requires a very expensive preprocess- 
ing step (substantially more costly than loading the data) in order 
to compute k (e.g., k = 500) minimal values (after permutation) 
for each data vector. Note that in prior studies for duplicate de- 
tection 1 6 1, k was usually not too large (i.e., 200), mainly because 
duplicate detection concerns highly similar pairs (e.g., R > 0.5). 
With 6-bit minwise hashing, we have to use larger k values accord- 
ing to the analysis in [26] even in the context of duplicate detec- 
tions. Note that classification tasks are quite different from dupli- 
cate detections. For example, in our most recent experiments on 
image classifications [24|, even k — 2000 did not appear to be 
sufficient. 

Consider, for example, the task of computing 6-bit minwise hash 
signatures for the task of Web page duplicate detection. While par- 
allelizing this task is conceptually simple (as the signatures of dif- 
ferent pages can be computed independently) it still comes at the 
cost of using additional hardware and electricity. Thus, any im- 
provements in the speed of signature computation may be directly 
reflected in the cost of the required infrastructure. 



For machine learning research and applications, this expensive 
preprocessing step can be a significant issue in scenarios where 
(either due to changing data distributions or features) models are 
frequently re-trained. In user-facing applications, the testing time 
performance can be severely affected by the preprocessing step if 
the (new) incoming data have not been previously processed. 

This paper studies how to speed up the execution of the signature 
computation through the use of graphical processing units (GPUs). 
GPUs offer, compared to current CPUs, higher instruction paral- 
lelism and very low latency access to the internal GPU memory, 
but comparatively slow latencies when accessing the main mem- 
ory 1221 . As a result, many data processing algorithms (especially 
such with random memory access patterns) do not benefit signif- 
icantly when implemented using a GPU. However, the character- 
istics of the minwise hashing algorithm make it very well suited 
for execution using a GPU. The algorithm accesses each set in its 
entirety, which allows for the use of main memory pre-fetching to 
reduce access latencies. Moreover, since we compute k different 
hash minima for each item in a set, the algorithm can make good 
use of the high degree of parallelism in GPUs. This is especially 
true for 6-bit minwise hashing, which, compared to the original 
algorithm, typically increases the number of hash functions (and 
minima) to be computed by a factor of 3 or more. Also, note that 
any improvements to the speed of (6-bit) minwise hashing are di- 
rectly applicable to large-scale instances of the other applications 
of minwise hashing mentioned previously. 

1.4 Issue 2: Online Learning 

Online learning has become increasingly popular in the context 
of web search and advertising [3, 12. 23. 31 1 as it only requires load- 
ing one feature vector at a time and thus avoids the overhead of 
storing a potentially very large dataset in memory or the complex- 
ity and cost of parallel learning architectures. In this paper, we 
demonstrate that 6-bit minwise hashing can also be highly benefi- 
cial for online learning because reducing the data size substantially 
decreases the loading time, which usually dominates the training 
cost, especially when many training epochs are needed. At the 
same time, the resulting reduction in the overall learning accuracy 
that we see in the experiments is negligible. Moreover, 6-bit min- 
wise hashing can also serve as an effective dimensionality reduc- 
tion scheme here, which can be important in the context of web 
search, as (i) machine learning models based on text n-gram fea- 
tures generally have very large numbers of features, and thus re- 
quire significant storage and (ii) there are typically a number of dif- 
ferent models deployed in user-facing search servers, all of which 
compete for the available main memory space. 

There is another benefit. Machine learning researchers have been 
actively developing good (and sometimes complex) online algo- 
rithms to minimize the need for loading the data many times (e.g., 1 38 1). 
If the data loading time is no longer a major bottleneck, then per- 
haps very simple online algorithms may be sufficient in practice. 

1.5 Issue 3: Massive Permutation Matrix 

When the data dimension (D) is not too large, e.g., millions, the 
implementation of 6-bit minwise hashing for learning is straight- 
forward. Basically, we can assume a "fully random permutation 
matrix" of size Dxk, which defines k permutation mappings. This 
is actually how researchers use (e.g., Matlab) simulations to verify 
the theoretical results assuming perfectly random permutations. 

Unfortunately, when the dimension is on the order of billions (let 
alone 2 64 ), it becomes impractical (or too expensive) to store such 
a permutation matrix. Thus, we have to resort to simple hash func- 
tions such as various forms of 2-universal (2U) hashing (e.g., j!3| ). 



Now the question is how reliable those hash functions are in the 
context of learning with 6-bit minwise hashing. 

There were prior studies on the impact of limited randomness 
on the estimation accuracy of (64-bit) minwise hashing, e.g., 1201 
1301 . However, no prior studies reported how the learning accuracies 
were affected by the use of simple hash functions for 6-bit minwise 
hashing. This study provides the empirical support that, as long as 
the data are reasonably sparse (as virtually always the case in the 
context of search), using 2U/4U hash functions results in negligible 
reduction of learning accuracies (unless 6 = 1 and k is very small). 

One limitation of GPUs is that they have fairly limited mem- 
ory |2). Thus, it becomes even more beneficial if we can reliably 
replace a massive permutation matrix with simple hash functions. 

2. SIMPLE HASH FUNCTIONS 

As previously discussed, in large-scale industry practice, it is 
often infeasible to assume perfect random permutations. For ex- 
ample, when D = 2 30 (about 1 billion) and k = 500, a matrix of 
D x k integers (4-byte each) would require > 2000GB of storage. 

To overcome the difficulty in achieving perfect permutations, the 
common practice is to use the so-called universal hashing |9|. One 
standard 2-universal (2U) hash function is, for j = 1 to k, 

hj (t) = {aij + a,2,j t mod p} mod D, (8) 

where p > D is a prime number and aij, a,2,j are chosen uni- 
formly from {0, 1, ...,p ~ 1}. To increase randomness, one can 
also use the following 4-universal (4U) hash function: 

hf u) {t) = j^c^jt 1 " 1 mod p\ mod D, (9) 

where the cnj (i = 1, 2, 3, 4) are chosen uniformly from {0, 1, ■■■,p— 
1}. The storage cost for retaining the ttij's is minimal, compared 
to storing a permutation matrix. In theory, the 4U hash function is 
(in the worst-case) more random than the 2U hash function. 

Now, to compute the minwise hashes for a given feature vector 
(e.g., a parsed document represented as a list of 1 -grams, 2-grams, 
and 3-grams, where each n-gram can be viewed as a binary fea- 
ture), we iterate over all non-zero features; any non-zero location t 
in the original feature vector is mapped to its new location hj(i); 
we then iterate over all mapped locations to find their minimum, 
which will be the jth hashed value for that feature vector. 

3. GPU FOR FAST PREPROCESSING 

In this section we will describe and analyze the use of graphics 
processors (GPUs) for fast computation of minwise hashes. We 
will first sketch the relevant properties of GPUs in general and then 
describe in how far minwise hashing computation is suited for exe- 
cution on this architecture. Subsequently, we will describe our im- 
plementation and analyze the resulting performance improvements 
over a CPU-based implementation. 

3.1 Introduction 

The use of GPUs as general-purpose coprocessors is relatively 
recent and primarily due to their high computational power at com- 
paratively low cost. In comparison with commodity CPUs, GPUs 
offer significantly increased computation speed and memory band- 
width. However, since GPUs have been designed for graphics pro- 
cessing, the programming model (which includes massively par- 
allel Single-Instmction-Multiple-Data (SIMD) processing and lim- 
ited bus speeds for data transfers to/from main memory) is not suit- 
able for arbitrary data processing applications 0191 . GPUs consist 



of a number of SIMD multiprocessors. At each clock cycle, all pro- 
cessors in a multiprocessor execute identical instructions, but on 
different parts of the data. Thus, GPUs can leverage spatial locality 
in data access and group accesses to consecutive memory addresses 
into a single access; this is referred to as coalesced access. 

3.2 Our Approach 

In light of the properties of GPU processing, our GPU algorithm 
to compute 6-bit minwise hashes proceeds in 3 distinct phases: 
First, we read in chunks of 10K sets from disk into main memory 
and write these to the GPU memory. Then, we compute the hash 
values and the corresponding minima by applying all k hash func- 
tions to the data currently in the GPU and retaining, for each hash 
function and set, the corresponding minima. Finally, we write out 
the resulting minima back to main memory and repeat the process. 

This batch-style computation has a number of advantages. Be- 
cause we transfer larger blocks of data, the main memory latency is 
reduced through the use of main memory pre-fetching. Moreover, 
because the computation within the GPU itself scans through con- 
secutive blocks of data in the GPU-internal memory (as opposed to 
random memory access patterns), performing the same computa- 
tion (with a different hash function) for each set entry k times, we 
can take full advantage of coalesced access and the massive paral- 
lelism inherent in the GPU architecture. 

Because GPUs are known to have fairly limited memory capac- 
ity, it becomes even more impractical to store a fully random per- 
mutation matrix; and hence it is crucial to utilize simple hash func- 
tions. We implemented both 2U and 4U hash functions introduced 
in Section [2] However, because the modulo operations in the def- 
initions of the 2U/4U hash functions are expensive especially for 
GPUs 1 2 1, we have used the following tricks to avoid them and 
make our approach (more) suitable for GPU-based execution. 

3.3 Avoid Modulo Operations in 2U Hashing 

To avoid the modulo operations in 2U hashing, we adopt a com- 
mon trick 11141 . Here, for simplicity, we assume D — 2 s < 2 32 
(note that D — 2 30 corresponds to about a billion features). It is 
known that the following hash function is essentially 2U 1141 : 

hj s) {t) = {aij +a 2 ,j t mod 2 32 } mod 2 s , (10) 

where aij is chosen uniformly from {0, 1, 2 32 — 1} and a%,j 
uniformly from {1, 3, 2 32 — 1} (i.e., a,2,j is odd). This scheme 
is much faster because we can effectively leverage the integer over- 
flow mechanism and the efficient bit-shift operation. In this paper, 
we always implement 2U hash using hf- . 

3.4 Avoid Modulo Operations in 4U Hashing 

It is slightly tricky to avoid the modulo operations in evaluating 
4U hash functions. Assuming D < p — 2 31 — 1 (a prime number), 
we provide the C# code to compute v mod p with p — 2 31 — 1: 

private static ulong BitMod (ulong v) 
{ 

ulong p = 2147483647; // p = 2 A 31-1 
v = (v >> 31) + (v & p) ; 
if (v >= 2 * p) 

v = (v >> 31) + (v & p) ; 
if (v >= p) 

return v - p; 
else 

return v; 

} 



To better understand the code, consider 

v modp = x, and v mod 2 31 = y 
=>v = pxZ + x = 2 31 xS + y 
=*E = 2 31 (S - Z) + Z + y 

for two integers S and Z. S and y can be efficiently evaluated using 
bit operations: S = v >> 31 and y = v & p. 

A recent paper 11341 implemented a similar trick for p = 2 61 — 1, 
which was simpler than ours because with p — 2 61 — 1 there is no 
need to check the condition "if (v >= 2 * p)". We find the case of 
p = 2 31 — 1 useful in machine learning practice because it suffices 
for datasets with less than a billion features. Note that a large value 
of p potentially increases the dimensionality of the hashed data. 

3.5 Experiments: Datasets 

Table [T] summarizes the two datasets used in this evaluation: 
webspam and rcvl. The webspam dataset was used in the recent 
paper [27|. Since the webspam dataset (24 GB in LibSVM format) 
may be too small compared to datasets used in industrial practice, 
in this paper we also present an empirical study on the expanded 
rcvl dataset [4j, which we generated by using the original features 
+ all pairwise combinations (products) of features + 1/30 of 3-way 
combinations (products) of features. Note that, for rcvl, we did not 
include the original test set in |4|, which has only 20242 examples. 
To ensure reliable test results, we randomly split our expanded rcvl 
dataset into two halves, for training and testing. 

Table 1: Data information 



Dataset n D # Avg Nonzeros Train / Test 

Webspam (24 GB) 350000 16609143 3728 80% / 20% 

Rcvl (200 GB) 781265 1010017424 12062 50% / 50% 



Table 2: The data loading and preprocessing (for k — 500 per- 
mutations) times (in seconds). Note that we measured the data 
loading times of LIBLINEAR which used a plain text data for- 
mat. The data loading times could be reduced by a factor of 5 
or so when the data were converted into binary. In other words, 
the (relative) preprocessing costs of minwise hashing would be 
even much more expensive if we optimized the data loading 
procedure of LIBLINEAR. This further explains why reducing 
the cost by using GPUs could be so beneficial. 

Dataset Loading Permu 2U 4U (Mod) 4U (Bit) 

Webspam 9.7 x 10 z 6.1 x 10 3 4.1 x 10 3 4.4 x 10 4 1.4 x 10* 
Rcvl 1.0 x 10 4 - 3.0 x 10 4 



in main memory. This makes them impractical for use with larger 
feature sets such as the rcvl data (with about a billion features) 

3.8 Experiments: GPU results 

The total overhead for the GPU-based processing for batch size 
= 10K is summarized in Table[3] demonstrating the substantial time 
reduction compared to the CPU-based processing in Table [2] For 
example, the cost of 2U processing on the webspam dataset is re- 
duced from 4100 seconds to 51 seconds, a 80-fold reduction. We 
also observe improvements of similar magnitude for4U processing 
(both modulo and bit versions) on webspam. For the rcvl dataset, 
the time reduction of the GPU-based implementation is about 20- 
fold, compared to the CPU-based implementation. 

For both datasets, the costs for the GPU-based preprocessing be- 
come substantially smaller than the data loading time. Thus, while 
achieving further reduction of the preprocessing cost is still inter- 
esting, it becomes less practically significant because we have to 
load the data once in the learning process. 



3.6 Experiments: Platform 

The GPU platform we use in our experiments is the NVIDIA 
Tesla C2050, which has 15 Simultaneous Multiprocessors (SMs), 
each with 2 groups of 16 scalar processors (hence 2 sets of 16- 
element wide SIMD units). The peak (single precision) GFlops of 
this GPU are 1030, with a peak memory bandwidth of 144 GB/s. In 
comparison, the numbers for a Intel Xeon processor X5670 (West- 
mere) processor are 278 GFlops and 60 GB/s. 

3.7 Experiments: CPU Results 

We use the setting of k = 500 for these experiments. Table [2] 
shows the overhead of the CPU-based implementation, broken down 
into the time required to load the data into memory and the time for 
the minwise hashing computation. For 2U, we always use the 2U 
hash function U0\ - For 4U (Mod), we use the 4U hash function {9} 
which requires the modulo operation. For 4U (Bit), we use the im- 
plementation in Section [Jl4l which converted the modulo operation 
into bit operations. Note that for rcvl dataset, we only report the 
experimental results for 2U hashing. 

Table [2] shows that the preprocessing using CPUs (even for 2U) 
can be very expensive, substantially more than data-loading. 4U 
hashing with modulo operations can take an order of magnitude 
more time than 2U hashing. As expected, the cost for 4U hashing 
can be substantially reduced if modulo operations are avoided. 

Note that for webspam dataset (with only 16 million features), 
using permutations is actually slightly faster than the algebra re- 
quired for 2U hash functions. The main constraint here is the stor- 
age space. The permutations are generated once and then stored 



Table 3: The data loading and preprocessing (for k = 500 per- 
mutations) times (in seconds) for using GPUs. 
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9.7 x 10^ 
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5.2 x 10^ 


1.2 x 10^ 


Rcvl 


1.0 x 10 4 


1.4 x 10 3 


1.5 x 10 4 


3.2 x 10 3 



Figures [TJ to [3] provide the breakdowns of the overhead for the 
GPU-based implementations, using 2U hashing, 4U hashing with 
modulo operations, and 4U hashing without modulo operations, re- 
spectively. As shown in the figures, we separate the overhead into 
three components: (i) time spent transferring the data from main 
memory to the GPU ("CPU — » GPU"), (ii) the actual computation 
("GPU Kernel") and (iii) transferring the k minima back to main 
memoiy ("GPU -> CPU"). 
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Figure 1: 2U. Overhead of the three phases of the GPU-based 
implementation using 2U hash functions, for both webspam (left 
panel) and rcvl (right panel) datasets. 
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Figure 2: 4U-Mod. Overhead of the three phases of the GPU- 
based implementation using 4U hash functions with modulo op- 
erations, for both datasets. 
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Figure 3: 4U-Bit. Overhead of the three phases of the GPU- 
based implementation using 4U hash functions without modulo 
operations (Section|3j4), for both datasets. 



Recall we adopt a batch-based GPU implementation by reading 
chunks of sets from disk into main memory and write them to the 
GPU memory. If the chunk size is not chosen appropriately, it may 
affect the GPU performance. Thus, we vary this parameter and 
report the performance for chuck sizes ranging from 1 to 50000. 

We can see from Figures[TJto |5]that the overall cost is not signif- 
icantly affected by the batch size, as long as it is reasonably large 
(e.g., in this case > 100). This nice property may simplify the 
design because practitioners will not have to try out many batch 
sizes. Note that the time spent in transferring the data to the GPU 
is not affected significantly by the batch size, but the speed at which 
data is transferred back does vary significantly with this parameter. 
However, for any setting of the batch size does it hold that the time 
spent transferring data is about two orders of magnitude smaller 
than the time spent on the actual processing within the GPU. This 
is the key to the large speed-up over CPU implementations we see. 

4. VALIDATION OF THE USE OF 2U/4U HASH 
FUNCTIONS FOR LEARNING 

For large-scale industrial applications, because storing a fully 
random permutation matrix is not practical, we have to resort to 
simple hash functions such as 2U or 4U hash families. However, 
before we can recommend them to practitioners, we must first val- 
idate on a smaller dataset that using such hash functions will not 
hurt the learning performance. To the best of our knowledge, this 
section is the first empirical study of the impact of hashing func- 
tions on machine learning with fo-bit minwise hashing. 

In addition, Appendix A provides another set of experiments for 
estimating resemblances using 6-bit minwise hashing with simple 
hash functions. Those experiments demonstrate that, as long as the 
data are not too dense, using 2U hash will produce very similar esti- 
mates as using fully random permutations. That set of experiments 
may help understand the experimental results in this section. Note 
that both datasets listed in TableQ]are extremely sparse. 

The webspam dataset is small enough (24GB and 16 million fea- 



tures) that we can conduct experiments using a permutation matrix. 
We chose LIBLINEAR as the underlying learning procedure. All 
experiments were conducted on workstations with Xeon(R) CPU 
(W5590@3.33GHz) and 48GB RAM, on a Windows 7 System. 

4.1 Experimental Results 

We experimented with both 2U and 4U hash schemes for train- 
ing linear SVM and logistic regression. We tried out 30 values for 
the regularization parameter C in the interval [10 -3 , 100]. We ex- 
perimented with lift values from k = 10 to k = 500, and for 7 6 
values: 6 = 1,2, 4, 6, 8, 10, 16. Each experiment was repeated 50 
times. The total number of training experiments turns out to be 

2x2x 30 x 11 x7x 50 = 462000. 

To maximize the repeatability, whenever page space allows, we 
always would like to present the detailed experimental results for 
all the parameters instead of, for example, only reporting the best 
results or the cross-validated results. In this subsection, we only 
present the results for webspam dataset using linear SVM because 
the results for logistic regression lead to the same conclusion. 

Figure [4] presents the SVM test accuracies (averaged over 50 
runs). For the test cases that correspond to the most likely parame- 
ter settings used in practice (e.g., k > 200 and b > 4), we can see 
that the results from the three hashing schemes (full permutations, 
2U, and 4U) are essentially identical. Overall, it appears that 4U is 
slightly better than 2U when b = 1 or k is very small. 

This set of experiments can provide us with a strong experimen- 
tal evidence that the simple and highly efficient 2U hashing scheme 
may be sufficient in practice, when used in the context of large- 
scale machine learning using 6-bit minwise hashing. 

4.2 Experimental Results on VW Algorithm 

The application domains of fo-bit minwise hashing are limited 
to binary data, whereas other hashing algorithms such as random 
projections and Vowpal Wabbit (VW) hashing algorithm 13311371 
are not restricted to binary data. It is shown in 0271 that VW and 
random projections have essentially the same variances as reported 
in 1 25 1 . In this study, we also provide some experiments on VW 
hashing algorithm because we expect that practitioners will find 
applications which are more suitable for VW than 6-bit minwise 
hashing. Since there won't be space to explain the details of VW 
algorithm, interested readers please refer to [27. 3311371 . 

In particular, in this subsection, we present a comparison be- 
tween the accuracy of VW when using a fully random implemen- 
tation and when using 2U hash functions. Figure|5]present the ex- 
perimental results for both linear SVM and logistic regression. As 
we can see, the two graphs are nearly identical, meaning that the 
performance 2U hashing seen earlier is not limited to the context 
of learning with 6-bit minwise hashing. 

5. LEARNING ON RCV1 DATA (200GB) 

Compared to webspam, the size of the expanded rcvl dataset 
may be more close to the training data sizes used in industrial ap- 
plications. We report the experiments on linear SVM and logistic 
regression, as well as the comparisons with the VW hash algorithm. 

5.1 Experiments on Linear SVM 

Figure|6]and Figure|7]respectively provide the test accuracies and 
train times, for training linear SVM. We can not report the baseline 
because the original dataset exceeds the memory capacity. Using 
merely k = 30 and 6 = 12, our method can achieve > 90% test 
accuracies. With k > 300, we can achieve > 95% test accuracies. 
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Figure 4: Linear SVM test accuracies on webspam, using three 
hashing schemes: permutations, 2U, and 4U. Both 2U and 4U 
perform well as the curves essentially overlap except when b — 
1 or small k. It appears that 4U is only slightly better than 2U. 
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Figure 5: Linear SVM (left panel) and logistic regression 
(right panel) test accuracies on webspam using full randomness 
("FULL") as well as 2U hash scheme ("2U"). We can see that 
the test accuracies are not affected much by using 2U hash. 



5.2 Experiments on Logistic Regression 

Figure [8] and Figure [9] respectively present the test accuracies 
and training times for training logistic regression. Again, using 
merely k = 30 and b — 12, our method can achieve > 90% test 
accuracies. With k > 300, we can achieve > 95% test accuracies. 

To help understand the significance of these results, next we pro- 
vide a comparison study with the VW hashing algorithm I33II371 . 

5.3 Comparisons with VW Algorithm 

The Vowpal Wabbit (VW) algorithm 13311371 is an influential 
hashing method for data/dimension reduction. Since [ 27] only com- 
pared fo-bit minwise hashing with VW on a small dataset, it is more 
informative to conduct a comparison of the two algorithms on this 
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Figure 6: Linear SVM test accuracy on rcvl. 
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Figure 7: Linear SVM training time on rcvl. 




Figure 8: Logistic regression test accuracy on rcvl. 
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Figure 9: Logistic regression training time on rcvl. 



much larger dataset (200GB). We experimented with VW using 
k = 2 5 to 2 14 hash bins (sample size). Note that 2 14 = 16384. 
It is difficult to train LIB LINEAR with k = 2 15 because the train- 
ing size of the hashed data by VW is close to 48 GB when k — 2 15 . 

Figure [Tol and Figure [TT] plot the test accuracies for SVM and 
logistic regression, respectively. In each figure, every panel has the 
same set of solid curves for VW but a different set of dashed curves 
for different values of b in 6-bit minwise hashing. Since the range 



of k is very large, here we choose to present the test accuracies 
against k. Representative C values (0.01, 0.1, 1, 10) are selected 
for the presentations. 

From Figures [TO] and [TT] we can see clearly that 6-bit minwise 
hashing is substantially more accurate than VW at the same storage. 
In other words, in order to achieve the same accuracy, VW will 
require substantially more storage than 6-bit minwise hashing. 
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Figure 10: SVM test accuracy on rcvl for comparing VW 
(solid) with 6-bit minwise hashing (dashed). Each panel plots 
the same results for VW and results for 6-bit minwise hashing 
for a different 6. We select C = 0.01, 0.1, 1, 10. 
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Figure 11: Logistic Regression test accuracy on rcvl for com- 
paring VW with 6-bit minwise hashing. 
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Figure 12: Training time for SVM (left) and logistic regression 
(right) on rcvl for comparing VW with 8-bit minwise hashing. 

Figure [T2lpresents the training times for comparing VW with 8- 
bit minwise hashing. In this case, we can see that even at the same 



k, 8-bit hashing may have some computational advantages com- 
pared to VW. Of course, as it is clear that VW will require a much 
larger k in order to achieve the same accuracies as 8-bit minwise 
hashing, we know that the advantage of 6-bit minwise hashing in 
terms of training time reduction is also enormous. 

Our comparison focuses on the VW hashing algorithm, not the 
VW online learning platform. The prior work 0101 experimented 
with the VW online learning platform on the webspam dataset and 
reported an accuracy of 98.42% (compared to > 99.5% in our 
experiments with 6-bit hashing) after 597 seconds of training. 

6. ONLINE LEARNING 

Batch learning algorithms (e.g., the LIBLINEAR package) face 
a challenge when the data do not fit in memory. In the context of 
search, the training datasets often far exceed the memory capac- 
ity of a single machine. One common solution (e.g., I10II39I ) is 
to partition the data into blocks, which are repeatedly loaded into 
memory, to update the model coefficients. However, this does not 
solve the computational bottleneck because loading the data blocks 
for many iterations consumes a large number of disk I/Os. 6-bit 
minwise hashing provides a simple solution for high-dimensional 
data by substantially reducing the data size. 

Another increasing popular solution is online learning, which re- 
quires only loading one feature vector at a time. Here, we follow 
the notational convention used in the online learning literature (and 
the SGD code (4)). Given a training set {x^, yi}2=i, we consider 
the following linear SVM optimization problem: 



A f 
mm — w w + 

w 2 



1 n 

— max 1 1 — yiW T Xi, j 



(11) 



That is, we replace the parameter C in ® by A = ^ . 

The stochastic gradient descent (SGD) |U[5] is a very popular 
algorithm for online learning. We modified Bottou's SGD code |4| 
to load one data point at a time. Basically, when the new data point 
{xt, yt} arrives, the weights w are updated according to 
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(12) 



where r\t is the learning rate. In Bottou's SGD code, r\t is initially 
set by a careful calibration step using a (small) subset of the exam- 
ples and is updated at every new data point. 

It is often observed that the test accuracy improves with increas- 
ing number of epoches (one epoch means one pass of the data). For 
example, Figure [13] shows that we need about 60 epochs for both 
•webspam and rcvl datasets to reach (fairly) stable predictions. 




40 60 
Epochs 

Figure 13: SGD SVM test accuracies on the original webspam 
(left panel) and rcvl (right panel) datasets. The results are 
somewhat sensitive to A. At the best A values, the test accu- 
racies increase with increasing number of epochs. 

6.1 SGD SVM Results on Webspam 

Figure [14] presents the test accuracies versus the regularization 
parameter A, at the last (100th) epoch. When 6 > 8 and k > 
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Figure 14: Test accuracies of SGD SVM on webspam at the 
100th epoch, for both the original data and the b-bit hashed 
data. When k > 200 and b > 8, fe-bit minwise hashing achieves 
similar accuracies as using the original data. 



200, using fo-bit hashing can achieve similar test accuracies as using 
the original data. Figure [T5l illustrates the test accuracies versus 
epochs, for two selected A values. Perhaps 20 epochs are sufficient 
for reaching a sufficient accuracy using 6-bit hashing. 

Figure [l6]plots the training time and loading time for both using 
the original data and using 8-bit minwise hashing. On average, 
using the original data takes about 10 times more time than using 
8-bit hashed data, as reflected in Table [4] Also, clearly the data 
loading time dominates the cost. 

Because the data loading time dominates the cost of online learn- 
ing, in our SGD experiments, we always first converted the data 
into binary format as opposed to the LIBSVM format used in our 
batch learning experiments. All the reported data loading times in 
this section were based on binary data. We would like to thank 
Leon Bottou for the highly helpful communications in this matter. 
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Figure 15: Test accuracies of SGD SVM on webspam versus 
epochs for two selected A values. 

6.2 SGD SVM Results on Rcvl 

Figure [P71 presents the test accuracies of SGD SVM on rcvl at 
the 100th epoch, for both the original data and the fe-bit (fe = 8 
and b = 12) hashed data. When k > 500 and 6 = 12, 6-bit 
minwise hashing achieves similar accuracies as using the original 
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Figure 16: Training time and data loading time on webspam 
versus epochs, for both the original data (dashed curves) and 
8-bit hashed data (solid curves, for k = 500, 200, 100). 6-bit 
hashing substantially reduces the training and loading times. 



data. Figure [18] presents the training time and data loading time. 
As explicitly calculated in Table[4] using the original data costs 30 
times more time than using 12-bit minwise hashing. Again, the 
data loading time dominates the cost. 
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Figure 17: Test accuracies of SGD SVM on rcvl at the 100th 
epoch, for both the original data and the 6-bit (6 = 8 and 6 = 
12) hashed data. When k > 500 and b = 12, fe-bit minwise 
hashing achieves similar accuracies as using the original data. 
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Figure 18: Training time and data loading time on rcvl versus 
epochs, for both the original data (dashed red curves) and 12- 
bit hashed data (solid curves, for k = 1000, 500, 200). 6-bit 
hashing substantially reduces the training and loading times. 

6.3 Averaged SGD (ASGD) 

In the course of writing this paper in 201 1, Leon Bottou kindly 
informed us that there was a recent major upgrade of the SGD code, 
which implemented ASGD (Averaged SGD) |38|. Thus, we also 
provide experiments of ASGD on the webspam dataset as shown in 
Figure [T9l 

Compared with the SGD results, it appears that ASGD does have 
some noticeable improvements over SGD. Nevertheless, ASGD 



Table 4: Time ratios for webspam and rcvl, as in Figures H6l 
and[18l averaged over 100 epochs. For example, the entry 10.05 
means training with the original webspam data on average re- 
quires 10.05 times as much time as using 8-bit minwise hashing. 
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Figure 19: Test accuracies of ASGD SVM on the webspam 
dataset. The left upper panel is only for the original data (ac- 
curacies versus epochs). All other panels are the test accuracies 
versus A, at the 100th epoch. 



still needs more than 1 epoch (perhaps 10 to 20) to approach the 
best accuracy. Also, 6-bit hashing continues to perform very well 
in terms of accuracy and training time reduction. 

7. CONCLUSION 

(6-bit) Minwise Hashing is a standard technique for similarity 
computation which has also recently been shown [27 1 to be a valu- 
able data reduction technique in (batch) machine learning, where it 
can reduce both the computational overhead as well as the required 
infrastructure and energy consumption by orders of magnitude, at 
often negligible reduction in learning accuracy. 

However, the use of 6-bit minwise hashing on truly large learning 
datasets, which frequently occur in the context of search, requires 
study of a number of related challenges. First, datasets with very 
large numbers of features make it impossible to use pre-computed 
permutation matrices for the permutation step, due to prohibitive 
storage requirements. Second, for very large data, the initial pre- 
processing phase during which minhash signatures are computed, 
consumes significant resources. And finally, while the technique 
has been successfully applied in the context of batch learning (on 
a fairly small dataset), its efficacy in the context of online learning 
algorithms (which are becoming increasingly popular in the con- 



text of web search and advertising) has not been shown. 

In the context of duplicate detection (which normally concerns 
only highly similar pairs of documents) using minwise hashing 
with 64 bits per hashed value, the prior studies (e.g., [6 1) demon- 
strated that it would be sufficient to use about k ~ 200 permu- 
tations. However, 6-bit minwise hashing (for small values of 6) 
does require more permutations than the original minwise hashing, 
as explained in 1261 . for example, by increasing k by a factor of 
3 when using 6=1 and the resemblance threshold is R = 0.5. 
In the context of machine learning and 6-bit minwise hashing, we 
have also found that in some datasets k has to be fairly large, e.g., 
k — 500 or even more. This is because machine learning algo- 
rithms use all similarities, not just highly similar pairs. 

In this paper, we addressed all of these challenges to adoption of 
6-bit minwise hashing in the context of Web-scale learning tasks. 
Regarding the first challenge, we were able to show that the use 
of 2- and 4-universal hash functions matches the accuracy of fully 
random permutations, while enabling the use of (6-bit) minwise 
hashing even for data with extremely large numbers of features. 

Regarding the 2nd challenge, we were able to formulate an im- 
plementation of the minwise hashing algorithm that effectively lever- 
ages the properties of current GPU architectures, in particular their 
massive parallelism and SIMD instruction processing, while mini- 
mizing the impact of their constraints (most notably slow modulo 
operations and the limited bandwidth available for main memory 
data transfer). We observed that the new GPU-based implemen- 
tation resulted in speed-ups of between 20-80 x for the minhash 
computation, thereby making the data loading time itself (and not 
the preprocessing) the new bottleneck. 

Finally, we were able to show that, similarly to batch learning, 6- 
bit minwise hashing can dramatically reduce the resource require- 
ments for online learning as well, with little reduction in accuracy. 
However, for online learning, the reduction was mainly due to the 
reduction in data loading time, which becomes a major factor when 
online learning requires a large number of epochs to converge. A 
side-effect of the use of models leveraging b-bit minwise hash- 
ing is that the space requirements of the resulting model itself are 
also dramatically reduced, which is important in the context of web 
search, where an incoming search query may trigger various mod- 
els for query classification, vertical selection, etc. and all of these 
models compete for memory on the user-facing servers. 
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APPENDIX 

A. RESEMBLANCE ESTIMATION USING 
SIMPLE HASH FUNCTIONS 

In this section we study the effect of using 2U/4U hashing func- 
tion in place of (fully) random permutation matrices on the accu- 
racy of resemblance estimation via b-bit minwise hashing. This 
will provide us a better understanding why the learning results (for 
SVM and logistic regression) using 6-bit minwise hashing are not 
noticeably affected much by replacing the fully random permuta- 
tion matrix with 2U/4U hash functions. As we shall see, as long 
as the original data are not too dense, using 2U/4U hash functions 
will not result in loss of estimation accuracy. As we observed that 
results from 2U and 4U are essentially indistinguishable, we only 
report the 2U experiments. 

The task we study here is the estimation of word associations. 
The dataset, extracted from commercial Web crawls, consists of 9 
pairs of sets (18 English words). Each set consists of the document 
IDs which contain the word at least once. Table [5] summarizes the 
data. 

Table 5: Data information of the 9 pairs of English words. For ex- 
ample, "KONG" and "HONG" correspond to the two sets of document 
IDs which contained word "KONG" and word "HONG" respectively. 



Word 1 


Word 2 


fl 


h 


R 


KONG 


HONG 


948 


940 


0.925 


RIGHTS 


RESERVED 


12234 


11272 


0.877 


OF 


AND 


37339 


36289 


0.771 


GAMBIA 


KIRIBATI 


206 


186 


0.712 


SAN 


FRANCISCO 


3194 


1651 


0.476 


CREDIT 


CARD 


2999 


2697 


0.285 


TIME 


JOB 


37339 


36289 


0.128 


LOW 


PAY 


2936 


2828 


0.112 


A 


TEST 


39063 


2278 


0.052 



We implemented both 2U hash l llOt and 4U hash schemes, for 
D = 2 16 , 2 18 , 2 20 , 2 22 , 2 24 , 2 26 , 2 28 , 2 30 , 2 32 . Note that D > 2 16 
is necessary for this dataset. After sufficient number of repetitions, 
we computed the simulated mean square error (MSE = Var + Bias 2 ) 
for each case, to compare with the theoretical variance (Eq. (11) 
in [26|), which was derived by assuming perfect random permu- 
tations. Ideally, the empirical MSEs and the theoretical variances 
should overlap. Indeed, we observe this is always the case when 
D > 2 20 . This is the reason why we only plot the results for 
D < 2 20 in Figures|20]to|22] In fact, as shown in Figure l20l when 
the data are not dense (e.g., KONG-HONG, GABMIA-KIRIBATI, 
SAN-FRANCISCO), using 2U can achieve very similar results as 
using perfect random permutations, even at the smallest D = 2 16 . 



Practical Implication: In practice, we expect the data vectors to 
be very sparse for a large number of applications, especially the 
many search-related tasks where features correspond to the pres- 
ence/absence of text 7i-grams. For these tasks, the large number of 
distinct words (e.g., 1181 reports 38M distinct 1-grams in an early 
Wikipedia corpus) and the much smaller number of terms in indi- 
vidual documents combine to cause this property. Therefore, we 
expect that 2U/4U hash functions will perform well when used for 
6-bit minwise hashing, as verified in the main body of the paper. 







OF - AND 
2U: D = 2 16 

1 = 1 

M 

b = l 








Sample size k Sample size k 

Figure 20: Mean square errors (MSEs) of the resemblance esti- 
mates using j4) and 2U hashing with D = 2 16 , on the 9 English 
word vector pairs in Table |5j We present b = 1, 2, 4 and the 
original minwise hashing (i.e., "M"). The dashed curves are the 
theoretical variances (Eq. (11) in |26|). Ideally the solid and 
dashed curves should overlap (e.g., KONG-HONG). Due to lim- 
ited randomness, when the data are fairy dense (e.g., OF-AND), 
the empirical estimates deviate from theoretical predictions. 




Figure 21: Mean square errors (MSEs) of the resemblance es- 
timates using (3) and 2U hashing with D = 2 18 , on 6 English 
word vector pairs which do not perform too well with D — 2 16 
in Figure[20] We can see that the results become much better. 




Figure 22: Mean square errors (MSEs) of the resemblance es- 
timates using |4) and 2U hashing with D = 2 20 , on 3 English 
word vector pairs which do not perform too well with D — 2 18 
in FigureHU We can see now all the dashed curves (theoretical) 
match the solid curves (empirical) now. 



